Let’s run some basic functions to examine the structure and schema of the data set.

## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

Initial observations are:

There are 1599 observations of 13 variables. X appears to be the unique identifier. Quality is an ordered, categorical, discrete variable. The values ranged only from 3 to 8, with a mean of 5.6 and median of 6. From the variable descriptions, it appears that fixed.acidity, volatile.acidity and free.sulfur.dioxide, total.sulfur.dioxide may be subsets of each other.

Since we’re primarily interested in categorizing/modelling wines based on quality, it would make sense to convert X and quality into factor variables.

##  Factor w/ 1599 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  Ord.factor w/ 6 levels "3"<"4"<"5"<"6"<..: 3 3 3 4 3 3 3 5 5 3 ...

Univariate Plots Section

Initial analysis with histogram

Plotting histogram and box plots(side by side) for count against each variable to check the distrubution of data. Box plots show a very clear reprentation of distribution of count for each variable, hence plotting box plots also.

Plotting fixed.acidity vs count

Since fixed.acidity vs count histogram/box-plot is positively skwed, re-plotted to see distribution by transforming fixed.acidity to log10(fixed.acidity) vs count.“” Histogram of log10(fixed.acidity) vs count shows more normalized data which is suited for modelling.

Plotting volatile.acidity vs count

Since volatile.acidity vs count histogram/box-plot is positively skewd, re-plotted to see distribution by transforming volatile.acidity to log10(volatile.acidity) vs count. Histogram of log10(volatile.acidity) vs count shows more normalized data which is suited for modelling.

Plotting citric.acid vs count

Since value of citric acid is 0 for many data points, removing those data points for re-plotting histogram for distibution of data.

citric acid seem to be an added ingredient as counts don’t seem to follow any order

Plotting residual.sugar vs count

Since residual.sugar vs count histogram/box-plot is positively skewd, re-plotting to see distribution by transforming chlorides to log10(residual.sugar) vs count. Histogram of log10(residual.sugar) vs count shows more normalized data which is suited for modelling.

Plotting chlorides vs count

Since chlorides vs count histogram/box-plot is positively skewd, re-plotting to see distribution by transforming chlorides to log10(chlorides) vs count. Histogram of log10(chlorides) vs count shows more normalized data which is suited for modelling.

Plotting free.sulfur.dioxide vs count

Since free.sulphur.dioxide vs count histogram/box-plot is positively skewd, re-plotting to see distribution by transforming free.sulphur.dioxide to log10(free.sulphur.dioxide) vs count. Histogram of log10(free.sulphur.dioxide) vs count shows more normalized data which is suited for modelling.

Plotting total.sulfur.dioxide vs count

Since total.sulphur.dioxide vs count histogram/box-plot is positively skewd, re-plotting to see distribution by transforming free.sulphur.dioxide to log10(total.sulphur.dioxide) vs count. Histogram of log10(total.sulphur.dioxide) vs count shows more normalized data which is suited for modelling.

Plotting density vs count

It appears that density is normally distributed, with few outliers.

Plotting pH vs count

It appears that pH is normally distributed, with few outliers.

Plotting sulphates vs count

Since total.sulphur.dioxide vs count histogram/box-plot is positively skewd, re-plotted to see distribution by transforming free.sulphur.dioxide to log10(total.sulphur.dioxide) vs count. Histogram of log10(total.sulphur.dioxide) vs count shows more normalized data which is suited for modelling.

Plotting alcohol vs count

Since alcohol vs count histogram/box-plot is positively skewd, re-plotted to see distribution by transforming alcohol to log10(alcohol) vs count. Histogram of log10(alcohol) vs count shows more normalized data which is suited for modelling.

Plotting quality vs count

It appears that density and pH are normally distributed, with few outliers.

Fixed and volatile acidity, sulfur dioxides, and alcohol seem to be long-tailed. The volatile acidity distribution appears bimodal at 0.4 and 0.6 with some outliers in the higher ranges.

Qualitatively, sulphates, residual sugar and chlorides have extreme outliers.

Citric acid appeared to have a large number of zero values. I’m curious whether this is actually zero, or if it is a case of non-reporting. After reading about wine making it became clear why citric acid quantity is zero in many wines. It’s because citric acid is an added ingredient to enhance the acitity of wines, not all wine makers add it.

Categorizing quality of wine as poor, average, best, based on range of quality of wine would be user friendly for analysis to check no. of wine obervations per category via histogram

##    poor average    best 
##      63    1319     217

Adding new variable called total.acidity as fixed.acidity, volatile.acidity and citric.acid are subsets of actual acidity

Distributions and Outliers

Density and pH are normally distributed. Rest all of the variables display positive skew.

If the distribution of a variable has a positive skew (with long tailed histogram), taking a logarithm of the variable sometimes helps fitting the variable into a model. Log transformations make positively skewed distribution more normal as observed in above histograms. Also, If most of the counts/wine data are represented in a certain range, it’s better to consider that range for modeling.

Univariate Analysis

What is/are the main feature(s) of interest in your dataset?

The main feature in the data is quality. I’d like to explore which features determine the quality of wines.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

The variables related to acidity (fixed, volatile, citric.acid and pH) might explain it’s affect on quality of wines. I suspect the different acid concentrations might alter the taste of the wine. Also, residual.sugar affects how sweet a wine is and might also have an influence on taste.

Did you create any new variables from existing variables in the dataset?

I created an ordered factor: quality - classifying each wine sample as ‘poor’, ‘average’, or ‘best’.

Upon further examination of the data set documentation, it appears that fixed.acidity and volatile.acidity are different types of acids; tartaric acid and acetic acid. I decided to create a combined variable, total.acidity, containing the sum of tartaric, acetic, and citric acid.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

I addressed the distributions in the ‘Distributions’ section. Boxplots are better suited in visualizing the outliers, hence plotted boxplots of each variable.

In univariate analysis, I chose not to tidy or adjust any data, except plotted select few (the ones that showed positive skew) on logarithmic scales, also removed counts from ploting for citric.acid=0.

Citric.acid stood out from the other distributions. It had (apart from some outliers) an rectangularly looking distribution which given the wine quality distribution seems very odd.

Bivariate Plots Section

Let’s check correlation of all variables to see how each variable influences quality

From correlation plot, Quality is most correlated: with alcohol and volatile.acidity, followed by sulphates and citric acid.

The other variables that are highly correlated are: fixed.acidity and volatile.acidity, fixed.acidity and free.sulphur.dioxide, fixed.acidity and density. Alcohol is correlated with density. Volatile.acidity is correlated with fixed.acidity

This means, quality is influenced by alcohol and volatile.acidity. Alcohol is correlated to density. Volatile.acidity is correlated to fixed.acidity. This means density and fixed.acidity also affect quality indirectly.

Plotting quality vs alcohol and quality vs volatile.acidity to see that the correlations displayed by ggcorr make sense

Above boxplot shows that alcohol content increases with increase in quality of wines.

Above boxplot shows that volatile.acidity decreases with increase in quality of wines.

Plotting quality vs sulphates and quality vs citirc acid to see that the correlations displayed by ggcorr make sense

Above boxplot shows sulphur content increases with increase in quality of wine.

Above boxlplot shows increase in citirc acid goes along with increase in quality.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Bivariate boxplots, with X as quality, is interesting in showing trends with wine quality. From exploring these plots, it seems that a ‘best’ wine generally has these trends: lower volatile acidity (acetic acid) and higher alcohol, sulphates and citric acid.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Interestingly, it appears that different types of acid affect wine quality differently; total.acidity showed that the presence of volatile (acetic) acid reduced quality.

What was the strongest relationship you found?

Strongest relationship is seen with alcohol content on quality of wine. Second strongest relationship is seen with alcohol content and volatile.acidity, more the alcohol and less the volatile.acidity , the better seems to be the wine. There might be other variables interaction in predicting the quality of wine, which can be analyzed by multivariate analysis.

Multivariate Analysis

Checking correlations of each variable with quality of wine applying cor.test:

##        fixed.acidity     volatile.acidity          citric.acid 
##           0.12405165          -0.39055778           0.22637251 
##        total.acidity log10.residual.sugar      log10.chlordies 
##           0.10375373           0.02353331          -0.17613996 
##  free.sulfur.dioxide total.sulfur.dioxide              density 
##          -0.05065606          -0.18510029          -0.17491923 
##                   pH      log10.sulphates              alcohol 
##          -0.05773139           0.30864193           0.47616632

Correlations nos. show that the correlation between quality of wine and alcohol content is highest, with volatile.acidity in second place (which is same as observed before with ggcorr plot), suplhates in third place and citric acid in fourth place. So, it will be interesting to see multivariate scatterplots between these variables show any combined effects on quality.

# Multivariate Plots Section

Let’s plot scatterplots with combination of variables to see which combinations show strong influence on quality of wine

Plotting alcohol, sulphates and quality

sulphur content is more in better quality of wine, better quality wines have higher alcohol content.

Plotting alcohol, volatile.acidity and quality

The poor quality and best quality wines show similar trend, whereas rest quality of wines show an opposite trend. This doesn’t make sense. This means volatile.acidity and alcohol together don’t provide us a reliable trend.

Plotting alcohol, citric.acid and quality

The poor quality and best quality wines show somewhat similar trend, whereas remaining qualities of wine show an opposite trend. This doesn’t make sense. This means citric acid and alcohol together don’t provide us a reliable trend.

Plotting citric.acid, volatile.acidity and quality

All the qualities of wines show that as citric acid increases, volatile acidity decreases for a particular quality of wine. Whereas, best quality wine has higher volatile acidity than second best quality of wine, this doesn’t make sense. This means that data is not reliable. There is some other factor in the best quality wine that we don’t know of that is showing up in this scatterplot.

Plotting sulphates, citric.acid and quality

Sulphur content and citric acid together doesn’t provide any valuable insight here.

Since acidity and pH are related to each other. The higher the acidity, the lower is the pH value of a liquid. So, let’s derive a regression model of acidity and pH to predict pH from acidity. Let’s boxplot the error as pH between observed pH and expected pH. If the boxplot shows more error in certain quality of wine, it means the observed data might not be the only variable affecting quality.

There is more pH.error in the poor quality wines, which means there are other variables like contaminations causing the quality of wine to be poor.

Linear model for wine data prediction and errors

Generating a predictive linear model:

## 
## Calls:
## m1: lm(formula = as.numeric(quality) ~ alcohol, data = training_data)
## m2: lm(formula = as.numeric(quality) ~ alcohol + log10(sulphates), 
##     data = training_data)
## m3: lm(formula = as.numeric(quality) ~ alcohol + log10(sulphates) + 
##     log10(volatile.acidity), data = training_data)
## m4: lm(formula = as.numeric(quality) ~ alcohol + log10(sulphates) + 
##     log10(volatile.acidity) + citric.acid, data = training_data)
## m5: lm(formula = as.numeric(quality) ~ alcohol + log10(sulphates) + 
##     pH, data = training_data)
## 
## ==================================================================================
##                               m1         m2         m3         m4         m5      
## ----------------------------------------------------------------------------------
##   (Intercept)              -0.259      0.337      0.257      0.317      1.851***  
##                            (0.232)    (0.234)    (0.226)    (0.226)    (0.500)    
##   alcohol                   0.374***   0.352***   0.312***   0.311***   0.365***  
##                            (0.022)    (0.022)    (0.021)    (0.021)    (0.022)    
##   log10(sulphates)                     1.884***   1.346***   1.472***   1.712***  
##                                       (0.223)    (0.223)    (0.228)    (0.227)    
##   log10(volatile.acidity)                        -1.304***  -1.523***             
##                                                  (0.147)    (0.172)               
##   citric.acid                                               -0.330*               
##                                                             (0.136)               
##   pH                                                                   -0.511***  
##                                                                        (0.149)    
## ----------------------------------------------------------------------------------
##   R-squared                     0.2        0.3       0.3        0.3        0.3    
##   adj. R-squared                0.2        0.3       0.3        0.3        0.3    
##   sigma                         0.7        0.7       0.7        0.7        0.7    
##   F                           285.4      189.0     162.3      123.8      131.3    
##   p                             0.0        0.0       0.0        0.0        0.0    
##   Log-likelihood            -1037.2    -1002.6    -964.7     -961.8     -996.7    
##   Deviance                    488.4      454.3     419.9      417.3      448.8    
##   AIC                        2080.4     2013.1    1939.5     1935.6     2003.4    
##   BIC                        2095.0     2032.6    1963.8     1964.8     2027.8    
##   N                           959        959       959        959        959      
## ==================================================================================

Notice I did not include pH in the same formula with the acids to avoid colinearity problems

Low R square value (calculated in multivariable analysis) indicates that the linear model is not reliable for predicting quality of wines.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

It clearly shows best quality wines are high on sulphates and alcohol contents.

Were there any interesting or surprising interactions between features?

It is interesting to see that all these variables are not enough to produce a linear model to predict quality of wine with sufficient accuracy as seen from the linear model developed in the multivariate plot analysis.

Final Plots and Summary

Plot One

Description One

This chart revealed how alcohol has a big influence on the quality of wines. Next time I’m the supermarket, it’s the first thing I’m going to look for.

Plot Two

Description Two

High alcohol contents and high sulphate concentrations combined seem to produce better wines.

Plot Three

The error in the predictions mean that there are missing variables that account for quality of wines, this data doesn’t seem very reliable in predicting quality of wines.

Reflection

The wine data set contains information on the chemical properties of a selection of wines. It also includes sensorial data (wine quality).

I started by looking at the individual distributions of the variables, trying to get a feel for each one. Single variable analysis helped transform data appropriately to represent a normalized distribution for developing a linear model.

Bivariable variables analysis displayed some strong noticable relationships between each variable and quality of wine. It was clear from bivariate analysis that alcohol content in wine is strong predictor of wine quality. Also, volatile.acidity strongly inflences quality of wine. Best quality wines have higher alcohol content and lower volatile.acidity.

Since acidity and pH are related to each other, a regression model of acidity and pH was derived to predict pH from acidity. It’s boxplot showed the error between observed pH and expected pH. The boxplot shows more error in poor quality of wines, it means the poor quality wines have some contamination which might be hurting the quality.

On the final part of the analysis, I tried using multivariate plots to investigate if there were interesting combinations of variables that might affect quality. In the end, the produced model could not explain much of the variance in quality. The data is insufficient to produce a better fitting model to predict wine quality with sufficient accuracy.

The difficulty faced in producing a better predictive model is insufficient data as quality of wine largly depends upon process of making it. Tannins and yeast used in making are important aspects. Also, aging of wine is an important factor in predicting the quality of wine. All these critical aspects that affect quality of wines are not available in this data, hence poor predictive model.

For future studies, it would be interesting to measure more data variables that affect wine quality for modelling wine quality.